Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for additional compressors/decompressors #7978

Closed
wants to merge 9 commits into from

Conversation

rvrangel
Copy link
Contributor

@rvrangel rvrangel commented Apr 28, 2021

Signed-off-by: Renan Rangel [email protected]

Description

This change will allow us to use an external decompressor when available as opposed to the internal pgzip currently used. We ran some tests using pigz -d -c and we get a nice speed bump on decompression, around 30% or more faster decompressing when restoring from a backup in S3. While decompressing is still single-threaded in pigz, it uses separate threads for reading, writing and checksum, and it results in faster performance.

I wanted to get an initial feedback on this change, as we would like to also to plug our own compressor to use other encryption algorithms (like lz4 or zstd) with an external binary as this is a more flexible setup for users, but also work on the change to add built-in support for one of these compression algorithms.

Related Issue(s)

#7802

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

This should not have any affect on current deployments. To take advantage of this, one needs to pass the correct flags to vttablet and the command also needs to exist in the OS, otherwise it will revert to the builtin decompressor.

@rvrangel rvrangel force-pushed the external-decompressor branch from 66c7b9c to 4becf1b Compare April 28, 2021 09:54
@rafael rafael requested a review from deepthi April 28, 2021 23:35
Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get @deepthi input on this.

@deepthi
Copy link
Member

deepthi commented May 5, 2021

Sorry for the delay in responding. I somehow missed the notification for reviewing this PR.
I am not opposed to introducing more options for decompressing the backup file. However, it will be good to make the UX slightly different from what is proposed in this PR.
We can assume that we will have two classes of decompressors

  1. Pure golang which can be plugged in directly. We currently use pgzip but there is a proposal to either replace it with zstd or provide zstd as an alternative.
  2. External binaries for which we need a path and we use command.Exec to run them.

To start with we only need one flag - let us call it xtrabackup_external_decompressor. If that is specified, we use it, otherwise we use the builtin. In future if we decide to support multiple builtin flavors, we can add another flag.
Some thought should be given to whether to expect the binary to exist on the path, or to expect the flag value to contain the full path.

@deepthi
Copy link
Member

deepthi commented May 24, 2021

Things have changed a little bit now (see detail in #8175), so the design will need to change to accommodate a choice of builtin decompressors also.

@rvrangel
Copy link
Contributor Author

@deepthi I saw that, I will update this PR to take it into account, thanks!

@rvrangel
Copy link
Contributor Author

rvrangel commented Jun 3, 2021

hi @deepthi, I have update the code with support for the two classes of compressors and decompressors:

  • external compressor/decompressor:
    • as you mentioned, I updated it to use xtrabackup_external_decompressor and also added two: xtrabackup_external_compressor and xtrabackup_external_compressor_extension to control which command to use for external compressor (along with the file extension to use)
  • builtin compressors/decompressors:
    • users should be able to use xtrabackup_builtin_compressor to choose a different compressor (defaults to pgzip) and they can also specify a decompressor with xtrabackup_builtin_decompressor but by default it will try to figure out which decompressor to use if not specified
    • I added support for lz4 and zstd, both in pure Go. I noticed there is a cgo version of the zstd library which was faster in my tests, we can also look into adding support for it.

@deepthi
Copy link
Member

deepthi commented Jun 3, 2021

hi @deepthi, I have update the code with support for the two classes of compressors and decompressors:

  • external compressor/decompressor:

    • as you mentioned, I updated it to use xtrabackup_external_decompressor and also added two: xtrabackup_external_compressor and xtrabackup_external_compressor_extension to control which command to use for external compressor (along with the file extension to use)
  • builtin compressors/decompressors:

    • users should be able to use xtrabackup_builtin_compressor to choose a different compressor (defaults to pgzip) and they can also specify a decompressor with xtrabackup_builtin_decompressor but by default it will try to figure out which decompressor to use if not specified
    • I added support for lz4 and zstd, both in pure Go. I noticed there is a cgo version of the zstd library which was faster in my tests, we can also look into adding support for it.

This sounds good. Let us not add cgo right now, we want to avoid that if possible.
LMK when the PR is ready to review, and at that time I can also run the CI for you.

@rvrangel rvrangel marked this pull request as ready for review June 8, 2021 11:50
@rvrangel
Copy link
Contributor Author

rvrangel commented Jun 8, 2021

@deepthi It should be ready for review now 👍, I would love to get your feedback on it

}
}

return "", fmt.Errorf("%w \"%s\"", errUnsupportedCompressionExtension, extension)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere where you have \"%s\" in a format string, you can replace it with %q, which adds the quotes automatically and escapes the contents of the string so they always look proper inside the quotes. 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat, I will change that!

@vmg
Copy link
Collaborator

vmg commented Jun 8, 2021

Hiiii @rvrangel! Been looking at this PR. I'm afraid that your usage of Exec and pipes in general is not correct here and may lead to corruption.

From the exec package docs:

StdoutPipe returns a pipe that will be connected to the command's standard output when the command starts.

Wait will close the pipe after seeing the command exit, so most callers need not close the pipe themselves. It is thus incorrect to call Wait before all reads from the pipe have completed. For the same reason, it is incorrect to call Run when using StdoutPipe.

You can see that this API in the stdlib is very nuanced. Wait will close the pipe as soon as the program has exited, and if we haven't read everything that has been buffered so far, we'd be dropping data.

I would recommend refactoring this implementation so that all io.ReadCloser and io.WriteCloser that are returned when creating a compression/decompression engine, whether it's builtin or external, have identical semantics. For the builtin types, the returned types can be the (de)compressor for their respective implementations, but for the externals, you probably want to wrap the Command as a whole plus its pipes, so that reads/writes are going through the pipe, and Close actually performs the Wait on the subprocess. This will also cleanup the callsite for this interfaces: right now the (de)compressors are always Closers, but the Close method must only be called for builtins, which makes the interface unsafe to use.

Let me know if you get stuck!

}

func prepareExternalCompressionCmd(ctx context.Context, cmdStr string) (*exec.Cmd, error) {
cmdArgs := strings.Split(cmdStr, " ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't perfectly cromulent; it will corrupt argument strings with quotes or escaped whitespace. I would suggest bringing https://pkg.go.dev/github.com/google/shlex as a (tiny) dependency to parse the commandline properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will take a look at it and make the changes 👍

@rvrangel
Copy link
Contributor Author

rvrangel commented Jun 8, 2021

hi @vmg, thanks for the feedback. It is correct that we should not call Wait() when using StdoutPipe(), in this code it only happens for the compressor, which uses StdinPipe(), not StdoutPipe() and is required to make sure everything we have written to the external process has been processed and written to is STDOUT to avoid corrupting data. Nevertheless, wrapping everything so they are treated in the same way makes sense, so I will make the changes and let you know

@vmg
Copy link
Collaborator

vmg commented Jun 8, 2021

I see now, I was a bit confused with the WaitGroup usage. I think if you wrap the Command and pipes, you should be able to run both compression and decompression with just a single goroutine (the logger), which should make it much easier to follow. Let's see what it looks like!

@rvrangel
Copy link
Contributor Author

rvrangel commented Jun 9, 2021

@vmg I updated the PR, let me know if that covers most of what you thought and if you have any additional feedback

Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Looks like a good start! Once @deepthi approves this we'll merge and I'll run some profiles to see how it performs in practice. Thank you!

Signed-off-by: Renan Rangel <[email protected]>
@rvrangel
Copy link
Contributor Author

and thanks for the feedback @vmg! As a heads up, one limiting factor we encountered is the download speed for the restores when using S3, due to the fact GetObject is used, so we can't avail of multipart download the same way we do for uploads (which limits download speeds to <100MB/s):

// ReadFile is part of the backupstorage.BackupHandle interface.
func (bh *S3BackupHandle) ReadFile(ctx context.Context, filename string) (io.ReadCloser, error) {
if !bh.readOnly {
return nil, fmt.Errorf("ReadFile cannot be called on read-write backup")
}
object := objName(bh.dir, bh.name, filename)
out, err := bh.client.GetObject(&s3.GetObjectInput{
Bucket: bucket,
Key: object,
SSECustomerAlgorithm: bh.bs.s3SSE.customerAlg,
SSECustomerKey: bh.bs.s3SSE.customerKey,
SSECustomerKeyMD5: bh.bs.s3SSE.customerMd5,
})
if err != nil {
return nil, err
}
return out.Body, nil
}

I have some code that implements multipart download as a buffer in memory, but still needs some polishing, but will soon create another PR. We saw some improvements on the 2-4x range when using other algorithms since pgzip is our main limiting factor at the moment. It can be helpful to test this PR with another backend that as of now can provide a better download speed (or maybe the file storage?) when you are testing it.

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just need the copyright notices, then good from my end.

@@ -0,0 +1,281 @@
package mysqlctl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is missing the copyright notice

@@ -0,0 +1,199 @@
package mysqlctl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on copyright notice

Signed-off-by: Renan Rangel <[email protected]>
@deepthi deepthi added the Type: Enhancement Logical improvement (somewhere between a bug and feature) label Jun 10, 2021
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the same support is not being added to the builtin engine? The compression options should be independent of the engine being used.
Also, more tests are needed. It should be fine to overload existing backup tests so that different tests use different options.
unit tests: https://github.com/vitessio/vitess/blob/main/go/vt/wrangler/testlib/backup_test.go
endtoend: https://github.com/vitessio/vitess/tree/main/go/test/endtoend/backup

// switch which compressor/decompressor to use
builtinCompressor = flag.String("xtrabackup_builtin_compressor", "pgzip", "which builtin compressor engine to use")
builtinDecompressor = flag.String("xtrabackup_builtin_decompressor", "auto", "which builtin decompressor engine to use")
// use and external command to decompress the backups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: and -> an

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While adding support to builtinbackupengine, all these flags should be renamed so that they are not xtrabackup specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @deepthi thanks for the feedback. I will add support for it this week in the builtin engine and for the tests 👍

@@ -63,6 +61,13 @@ var (
// striping mode
xtrabackupStripes = flag.Uint("xtrabackup_stripes", 0, "If greater than 0, use data striping across this many destination files to parallelize data transfer and decompression")
xtrabackupStripeBlockSize = flag.Uint("xtrabackup_stripe_block_size", 102400, "Size in bytes of each block that gets sent to a given stripe before rotating to the next stripe")
// switch which compressor/decompressor to use
builtinCompressor = flag.String("xtrabackup_builtin_compressor", "pgzip", "which builtin compressor engine to use")
builtinDecompressor = flag.String("xtrabackup_builtin_decompressor", "auto", "which builtin decompressor engine to use")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to add some help text on what "auto" means.

@rvrangel rvrangel requested a review from mattlord as a code owner March 24, 2022 17:06
@rvrangel rvrangel changed the title Support for external decompressor Support for additional compressors/decompressors Mar 24, 2022
@deepthi deepthi added the release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) label Mar 31, 2022
@deepthi
Copy link
Member

deepthi commented Mar 31, 2022

@rvrangel when you get back to this, we will also need a companion website PR to document the new options.

@deepthi
Copy link
Member

deepthi commented Jul 12, 2022

Superseded by #10558 which has been merged.

@deepthi deepthi closed this Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants